EDA - Exploratory Data Analysis!¶
Analysis of data saved in the file student_lifestyle_dataset.csv downloaded from the Kaggle website.¶
This dataset, titled "Daily Lifestyle and Academic Performance of Students", contains data from 2,000 students collected via a Google Form survey. It includes information on study hours, extracurricular activities, sleep, socializing, physical activity, stress levels, and CGPA. The data covers an academic year from August 2023 to May 2024 and reflects student lifestyles primarily from India. This dataset can help analyze the impact of daily habits on academic performance and student well-being.¶
- File Format: CSV
- File Name: Daily_Lifestyle_and_Academic_Performance.csv
- Number of Records: 2000 rows
- Number of Columns: 8 columns
- Column Names: Student ID, Study Hours, Extracurricular Hours, Sleep Hours, Social Hours, Physical Activity Hours, Stress Level, GPA
- File Size: Approximately 150 KBsl 150 KB
Student project.¶
In [5]:
# import of necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
In [3]:
# defining student_df variable and reading the csv file into the DataFrame
student_df = pd.read_csv('student_lifestyle_dataset.csv', sep=",")
In [4]:
student_df
Out[4]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6.9 | 3.8 | 8.7 | 2.8 | 1.8 | 2.99 | Moderate |
| 1 | 2 | 5.3 | 3.5 | 8.0 | 4.2 | 3.0 | 2.75 | Low |
| 2 | 3 | 5.1 | 3.9 | 9.2 | 1.2 | 4.6 | 2.67 | Low |
| 3 | 4 | 6.5 | 2.1 | 7.2 | 1.7 | 6.5 | 2.88 | Moderate |
| 4 | 5 | 8.1 | 0.6 | 6.5 | 2.2 | 6.6 | 3.51 | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1995 | 1996 | 6.5 | 0.2 | 7.4 | 2.1 | 7.8 | 3.32 | Moderate |
| 1996 | 1997 | 6.3 | 2.8 | 8.8 | 1.5 | 4.6 | 2.65 | Moderate |
| 1997 | 1998 | 6.2 | 0.0 | 6.2 | 0.8 | 10.8 | 3.14 | Moderate |
| 1998 | 1999 | 8.1 | 0.7 | 7.6 | 3.5 | 4.1 | 3.04 | High |
| 1999 | 2000 | 9.0 | 1.7 | 7.3 | 3.1 | 2.9 | 3.58 | High |
2000 rows × 8 columns
General overview of the data.¶
In [6]:
student_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Student_ID 2000 non-null int64 1 Study_Hours_Per_Day 2000 non-null float64 2 Extracurricular_Hours_Per_Day 2000 non-null float64 3 Sleep_Hours_Per_Day 2000 non-null float64 4 Social_Hours_Per_Day 2000 non-null float64 5 Physical_Activity_Hours_Per_Day 2000 non-null float64 6 GPA 2000 non-null float64 7 Stress_Level 2000 non-null object dtypes: float64(6), int64(1), object(1) memory usage: 125.1+ KB
In [7]:
student_df.head() # displaying initial values
Out[7]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6.9 | 3.8 | 8.7 | 2.8 | 1.8 | 2.99 | Moderate |
| 1 | 2 | 5.3 | 3.5 | 8.0 | 4.2 | 3.0 | 2.75 | Low |
| 2 | 3 | 5.1 | 3.9 | 9.2 | 1.2 | 4.6 | 2.67 | Low |
| 3 | 4 | 6.5 | 2.1 | 7.2 | 1.7 | 6.5 | 2.88 | Moderate |
| 4 | 5 | 8.1 | 0.6 | 6.5 | 2.2 | 6.6 | 3.51 | High |
In [8]:
student_df.tail() # displaying final values
Out[8]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 1995 | 1996 | 6.5 | 0.2 | 7.4 | 2.1 | 7.8 | 3.32 | Moderate |
| 1996 | 1997 | 6.3 | 2.8 | 8.8 | 1.5 | 4.6 | 2.65 | Moderate |
| 1997 | 1998 | 6.2 | 0.0 | 6.2 | 0.8 | 10.8 | 3.14 | Moderate |
| 1998 | 1999 | 8.1 | 0.7 | 7.6 | 3.5 | 4.1 | 3.04 | High |
| 1999 | 2000 | 9.0 | 1.7 | 7.3 | 3.1 | 2.9 | 3.58 | High |
In [9]:
student_df.sample(15) # displaying 15 random records
Out[9]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 700 | 701 | 6.7 | 2.0 | 8.1 | 5.5 | 1.7 | 3.11 | Moderate |
| 1860 | 1861 | 6.2 | 4.0 | 6.0 | 3.5 | 4.3 | 2.88 | Moderate |
| 1650 | 1651 | 6.3 | 3.2 | 9.7 | 0.7 | 4.1 | 2.65 | Moderate |
| 224 | 225 | 8.9 | 0.4 | 8.8 | 3.2 | 2.7 | 3.32 | High |
| 1115 | 1116 | 6.1 | 2.9 | 6.6 | 3.0 | 5.4 | 3.01 | Moderate |
| 848 | 849 | 10.0 | 1.0 | 9.4 | 3.0 | 0.6 | 3.40 | High |
| 6 | 7 | 8.0 | 0.7 | 5.3 | 5.7 | 4.3 | 3.08 | High |
| 1984 | 1985 | 8.5 | 0.3 | 7.1 | 3.4 | 4.7 | 3.23 | High |
| 1870 | 1871 | 5.6 | 3.3 | 6.3 | 4.7 | 4.1 | 2.85 | Low |
| 861 | 862 | 8.7 | 0.1 | 7.9 | 5.6 | 1.7 | 3.30 | High |
| 1024 | 1025 | 7.6 | 4.0 | 8.1 | 1.6 | 2.7 | 3.02 | Moderate |
| 466 | 467 | 7.2 | 1.2 | 6.3 | 1.2 | 8.1 | 3.05 | Moderate |
| 1717 | 1718 | 7.4 | 1.2 | 7.6 | 2.3 | 5.5 | 3.03 | Moderate |
| 153 | 154 | 7.1 | 2.9 | 9.7 | 4.0 | 0.3 | 3.33 | Moderate |
| 1877 | 1878 | 8.0 | 2.0 | 8.2 | 5.6 | 0.2 | 3.40 | Moderate |
In [10]:
student_df.describe() # displaying statistics for numeric columns
Out[10]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | |
|---|---|---|---|---|---|---|---|
| count | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.00000 | 2000.000000 |
| mean | 1000.500000 | 7.475800 | 1.990100 | 7.501250 | 2.704550 | 4.32830 | 3.115960 |
| std | 577.494589 | 1.423888 | 1.155855 | 1.460949 | 1.688514 | 2.51411 | 0.298674 |
| min | 1.000000 | 5.000000 | 0.000000 | 5.000000 | 0.000000 | 0.00000 | 2.240000 |
| 25% | 500.750000 | 6.300000 | 1.000000 | 6.200000 | 1.200000 | 2.40000 | 2.900000 |
| 50% | 1000.500000 | 7.400000 | 2.000000 | 7.500000 | 2.600000 | 4.10000 | 3.110000 |
| 75% | 1500.250000 | 8.700000 | 3.000000 | 8.800000 | 4.100000 | 6.10000 | 3.330000 |
| max | 2000.000000 | 10.000000 | 4.000000 | 10.000000 | 6.000000 | 13.00000 | 4.000000 |
Preliminary observations.¶
According to the analyzed data set, students study on average 7 hours a day. The maximum learning time is 10 hours a day.¶
The maximum amount of time allocated to extracurricular activities is 4 hours per day.¶
Students sleep on average about 7 hours, with the shortest sleep time being 5 hours and the longest 10 hours. On average, social hours are over 2 hours.¶
They spend an average of 4 hours a day on physical activity, with a maximum of 13 hours.¶
The highest GPA is 4 and the lowest is 2.24¶
Missing value analysis.¶
In [11]:
student_df.isnull().sum()
Out[11]:
Student_ID 0 Study_Hours_Per_Day 0 Extracurricular_Hours_Per_Day 0 Sleep_Hours_Per_Day 0 Social_Hours_Per_Day 0 Physical_Activity_Hours_Per_Day 0 GPA 0 Stress_Level 0 dtype: int64
In [12]:
student_df[student_df.duplicated()] # displaying duplicates
Out[12]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level |
|---|
The analyzed set has no missing values and no duplicates.¶
Single variable analysis.¶
In [13]:
sns.displot(data = student_df, x = "Study_Hours_Per_Day", col = "Stress_Level", kde=True)
#distribution of numerical variables divided into stress levels
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x207e41b5550>
In [14]:
sns.displot(data=student_df, x="Extracurricular_Hours_Per_Day", col = "Stress_Level", kde=True)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x207ec8acc50>
In [15]:
sns.displot(data=student_df, x="Sleep_Hours_Per_Day", col = "Stress_Level", kde=True)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x207ec4fb1d0>
In [16]:
sns.displot(data=student_df, x="Social_Hours_Per_Day", col = "Stress_Level", kde=True)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x207ed2b3490>
In [17]:
sns.displot(data=student_df, x="Physical_Activity_Hours_Per_Day", col = "Stress_Level", kde=True)
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x207ef064bd0>
In [18]:
sns.displot(data=student_df, x="GPA", col = "Stress_Level", kde=True)
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x207ee8158d0>
In [19]:
student_df['Stress_Level'].value_counts()
Out[19]:
High 1029 Moderate 674 Low 297 Name: Stress_Level, dtype: int64
In [20]:
student_df[student_df['GPA']==4.0] # displaying the student with the highest GPA
Out[20]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 51 | 52 | 9.0 | 2.6 | 8.5 | 3.1 | 0.8 | 4.0 | High |
In [21]:
student_df[student_df['GPA']==2.24] # displaying the student with the lowest GPA
Out[21]:
| Student_ID | Study_Hours_Per_Day | Extracurricular_Hours_Per_Day | Sleep_Hours_Per_Day | Social_Hours_Per_Day | Physical_Activity_Hours_Per_Day | GPA | Stress_Level | |
|---|---|---|---|---|---|---|---|---|
| 764 | 765 | 5.5 | 1.8 | 6.7 | 5.2 | 4.8 | 2.24 | Low |
Short observations.¶
Most of the surveyed students are characterized by high levels of stress. Low stress levels occur only in 297 students out of 2,000 analyzed.¶
Low-stress students spent about 5-6 hours a day studying. In the group with a moderate level of stress, we notice the period devoted to learning - over 5 hours, but less than 9.¶
However, in the group of students with a high level of stress, this range is wide - from 5 to 10 hours a day for studying. However, most of them study for about 9 hours. Only in the group of people with high levels of stress do we observe the amount of sleep less than 6 hours.¶
In the group of students with high levels of stress, we observe that most of them spend less than 6 hours sleeping.¶
Analysis of relationships between variables.¶
In [22]:
plt.scatter('Study_Hours_Per_Day', 'GPA', data=student_df)
plt.xlabel('Study_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between study hours and grade point average.')
plt.show()
The longer the time spent studying, the higher the average grade.¶
In [23]:
sns.relplot(
data=student_df,
x="GPA",
y="Study_Hours_Per_Day",
col="Stress_Level",
hue="Stress_Level",
)
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x207efaa09d0>
In [24]:
sns.lmplot(data=student_df, x="GPA", y="Study_Hours_Per_Day", col="Stress_Level", hue="Stress_Level")
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x207f1e49b10>
In [25]:
sns.jointplot(data=student_df, x="Study_Hours_Per_Day", y="GPA", hue="Stress_Level")
Out[25]:
<seaborn.axisgrid.JointGrid at 0x207ec773750>
In [26]:
fig = px.scatter(student_df, x="GPA", y="Extracurricular_Hours_Per_Day")
fig.update_layout(
title="Relationship between extracurricular activities and grade point average",
xaxis_title="GPA",
yaxis_title="Extracurricular_Hours_Per_Day",
)
fig.show()
In [27]:
plt.scatter('Sleep_Hours_Per_Day', 'GPA', data=student_df)
plt.xlabel('Sleep_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between sleep hours and grade point average.')
plt.show()
In [28]:
sns.relplot(
data=student_df,
x="GPA",
y="Sleep_Hours_Per_Day",
col="Stress_Level",
hue="Stress_Level",
)
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x207f4bfe590>
In the group of students with high levels of stress, we notice that low average grades go hand in hand with short sleep time.¶
In [29]:
sns.relplot(
data=student_df,
x="Sleep_Hours_Per_Day",
y="Study_Hours_Per_Day",
col="Stress_Level",
hue="GPA",
size="GPA",
)
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x207f4c103d0>
In [30]:
plt.scatter('Social_Hours_Per_Day', 'GPA', data=student_df)
plt.xlabel('Social_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between social activity hours and grade point average.')
plt.show()
In [31]:
sns.relplot(
data=student_df,
x="Study_Hours_Per_Day",
y="Social_Hours_Per_Day",
col="Stress_Level",
hue="GPA",
#size="Stress_Level",
)
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x207f597b750>
In [32]:
plt.scatter('Physical_Activity_Hours_Per_Day', 'GPA', data=student_df)
plt.xlabel('Physical_Activity_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between physicial activity hours and grade point average.')
plt.show()
In [33]:
sns.jointplot(data=student_df, x="Social_Hours_Per_Day", y="Physical_Activity_Hours_Per_Day", hue="Stress_Level")
Out[33]:
<seaborn.axisgrid.JointGrid at 0x207f512c4d0>
In [34]:
sns.relplot(
data=student_df,
x="Social_Hours_Per_Day",
y="Physical_Activity_Hours_Per_Day",
col="Stress_Level",
hue="GPA",
size="GPA",
)
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x207f5b209d0>
We observe mainly in the group of students with low and moderate stress that as the hours devoted to activity increase, the amount of social time decreases. However, in a group with a high level of stress and more time devoted to physical and social activities, the lower the GPA.¶
In [35]:
sns.relplot(
data=student_df,
kind="line",
x="Sleep_Hours_Per_Day",
y="Physical_Activity_Hours_Per_Day",
style="Stress_Level",
hue="Stress_Level",
)
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x207f5c42450>
In [36]:
correlation_df = student_df[['Study_Hours_Per_Day','Extracurricular_Hours_Per_Day','Sleep_Hours_Per_Day', 'Social_Hours_Per_Day', 'Physical_Activity_Hours_Per_Day', 'GPA',]].corr()
In [37]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_df, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Heatmap showing correlations between variables')
plt.show()
There is a noticeable correlation between the time spent studying and the average grade.¶
In [38]:
sns.pairplot(data=student_df, hue="Stress_Level")
Out[38]:
<seaborn.axisgrid.PairGrid at 0x207f5b8de10>
Outlier analysis.¶
In [39]:
student_df.groupby('Stress_Level').plot(kind='box', figsize=(20,8), grid=True)
Out[39]:
Stress_Level High Axes(0.125,0.11;0.775x0.77) Low Axes(0.125,0.11;0.775x0.77) Moderate Axes(0.125,0.11;0.775x0.77) dtype: object
Outliers appear in GPA, Study Hours and Physical Activity Hours.¶
In [40]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level', y='GPA', hue='Stress_Level')
plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('GPA', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
In [41]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level', y='Physical_Activity_Hours_Per_Day', hue='Stress_Level')
plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Physical_Activity_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
In [42]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level', y='Study_Hours_Per_Day', hue="Stress_Level" )
plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Study_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()